Evaluation of blood glucose level control in type 1 diabetic patients using deep reinforcement learning

Diabetes mellitus is a disease associated with abnormally high levels of blood glucose due to a lack of insulin. Combining an insulin pump and continuous glucose monitor with a control algorithm to deliver insulin is an alternative to patient self-management of insulin doses to control blood glucose levels in diabetes mellitus patients. In this work, we propose a closed-loop control for blood glucose levels based on deep reinforcement learning. We describe the initial evaluation of several alternatives conducted on a realistic simulator of the glucoregulatory system and propose a particular implementation strategy based on reducing the frequency of the observations and rewards passed to the agent, and using a simple reward function. We train agents with that strategy for three groups of patient classes, evaluate and compare it with alternative control baselines. Our results show that our method is able to outperform baselines as well as similar recent proposals, by achieving longer periods of safe glycemic state and low risk.


Introduction
Diabetes mellitus (DM) is a disease associated with abnormally high levels of blood sugar (blood glucose) due to lack of insulin (type 1 diabetes-T1D) or insulin resistance (type 2 diabetes-T2D). In 2019, approximately 463 million adults worldwide were suffering from DM, which is increasing continuously [1]. More than 1.1 million children and adolescents are living with type 1 diabetes. In addition, there are 4.2 million deaths caused by DM. T1D can produce little or no insulin, so patients must monitor their blood glucose (BG) level and manually administer insulin doses. Excessive insulin doses, may cause hypoglycemia, that is, low blood glucose levels, which can lead to short-term complications, such as drowsiness, shakiness, confusion, loss of consciousness, seizure, or even coma or death [2,3]. On the other hand, too little insulin can result in hyperglycemia, that is, high blood glucose levels and can cause long-term chronic diseases, including retinopathy, nephropathy, and neuropathy [3]. realistic simulators of the glucoregulatory systems cannot be considered fully observable. Therefore, the environment is at most a Partially Observable MDP (POMDP) and, in spite of their ubiquity in many fields, only very few recent algorithms have been developed specifically for POMDPs [23]. Moreover, POMDPs require a mapping from of true environment states to observable variables that have to be defined by the control algorithm designer [5], usually with a high degree of arbitrariness. This may have a serious influence on the learning ability of the agent because a bad choice of observable states makes state changes and associated rewards not directly related to agent actions. For instance, a delayed response to insulin is a realistic feature of the simulator but it is also related to the choice of POMDP state mapping, because a CGM reading (observation) after an insulin injection (action) does not reflect the actual change of state. Second, compared to other learning environments, there is a remarkable number of design alternatives whose influence is not clear and usually require careful trial and evaluation. Those involve the choice of the RL algorithm (agent) and its underlying neural network architecture, the tuning of agent hyperparameters, the selection of an appropriate reward function, and even the design of the action space that may be adapted to patient specific data [7,21]. For the sake of conciseness, the design choices related to the aforementioned issues are called implementation strategy in the rest of the paper. From the above discussion, it is clear that different implementation strategies can lead to effective RL-based closed-loop controls. They may result in a viable controller or not, and with widely different performance, so the implementation strategy is subject to further investigation, as it is addressed in this paper. In fact, straightforward implementations, as we discuss and show in the following sections, do not work properly. Therefore, the RL approach to control is not different to the other main approaches to the control problem, PID and MPC, in the sense that it can be considered a generic approach, with many potential different implementations which are proposed and evaluated [14,15]. In this paper, we evaluate two different reinforcement learning algorithms to control in silico blood glucose levels in T1D and compare them with other well-known alternatives, including a PID controller. We describe and discuss our implementation strategy and related problems and compare it with recent proposals using a different implementation strategy [7,21]. Our work shares with these proposals the initial premise: to train state-of-the-art Deep Reinforcement Learning (DRL) algorithms, such as Soft Actor-Critic (SAC) [24] or Proximal Policy Optimization (PPO) [25], as a T1D BG closedloop control, but it differs in several aspects of the implementation strategy that we discuss in the paper. Our results show that our strategy can effectively control BG levels, outperforming control baselines in terms of the fraction of time spent in the desired glycaemic state and risk metrics. All the source codes used for evaluation and training as well as the trained agent policies described in this paper are publicly available in our repository (https://github.com/girtel/ AIML4Diabetes).
The paper is organized as follows. Next section contains background information needed to better understand the contribution. Afterwards, we describe our implementation strategy and design methodology. Our results are described and compared with baselines in the Results section, and compared with similar proposals in the Discussion section. Finally, we provide concluding remarks.

Type 1 diabetes mellitus
Type 1 diabetes (T1D) is an autoimmune disease involving that the pancreatic islets produce little or no insulin. Insulin is an anabolic polypeptide hormone regulating carbohydrate metabolism and normal glucose level in the circulatory system. Pathogenesis of T1D is not yet known [26], but it is often found in family members who have had a history of this disease. T1D is characterized by recurrent or persistent excessive amount of glucose in blood plasma, that is, hyperglycemia, which can be diagnosed by several criteria defined by World Health Organization (WHO) [27]. Patients with this type of diabetes need to use insulin by external means to maintain blood sugar levels, usually by subcutaneous injection. There are four main types of injectable insulin: rapid-acting insulin, short-acting insulin, intermediate-acting insulin, and long-acting insulin. The insulin dose that a patient has to inject is traditionally planned by a physician according to the patient characteristics and clinical history.

Models and simulation
Currently, it is possible to experiment with virtual patient populations in simulators to determine the optimal insulin dosage before being used in patients [18,28]. Several simulators exist to research on T1D models, such as TD1_VPP [28], Dosing-RL Gym [29] or UVA/PADOVA Simulator [18]. TD1_VPP is a virtual patient population software which can generate virtual patients following single-hormone and dual-hormone mathematical models, integrating the effects of exercise in the glucoregulatory system. Dosing-RL Gym is based on an expanded version of the Bergman minimal model, which includes meal disturbances [29].
The UVA/PADOVA Type 1 Diabetes Simulator can be used as a substitute for preclinical testing of closed-loop control strategies. The simulator is developed by the Universities of Virginia (UVA) and Padova and has been accepted by the United States Food and Drug Administration (FDA) [18]. It is the most used simulator among in silico software, according to [5,30]. There are four main components of the simulation, which are depicted in Fig 1: (1) In silico patient-a model of the glucose-insulin system in a patient; (2) In silico sensor-a model of the sensor to measure BG including its error; (3) Controller-a model used to estimate the amount of insulin to maintain blood sugar; and (4) In silico pump-a model of discrete insulin delivery and subcutaneous kinetics. In this paper, we use an available open-source Python implementation of the UVA/PADOVA simulator [19], previously used in similar studies [5,7]. The simulation environment implements the OpenAI gym interface [31], which makes it compatible with several RL frameworks. It also simulates different noisy CGM sensors, insulin pumps and a random meal scheduler.

Deep reinforcement learning
Reinforcement Learning (RL) refers to a class of methods to optimize the decision process of an agent operating on a given environment. The decision process is usually considered to be a Markov Decision Process (MDP), a discrete-time stochastic control process. Formally, an MDP is defined by a 4-tuple (S, A, P, R) of states, s 2 S, actions, a 2 A, a state-transition function Pa (s, s 0 ) = Pr(s t+1 = s 0 |s t = s, a t = a), and reward function R a (s, s 0 ). The agent is the learner entity that seeks the optimal behavior and is able to perform an action a(s), which changes the state. In this change of state from s to s 0 , the agent obtains a reward r, considered as the feedback from the environment. MDP-solving algorithms employ what is called a policy, denoted as π, which is a mapping between states and actions; that is π: s ! a. Their goal is to reach an optimal policy π � , which maximizes the accumulated sum of rewards over the entire lifespan of the agent. This decision policy can be determined by the state-action function, also called Q-function, Q(s, a), which can be approximated using Deep Neural Networks (DNN). Deep Reinforcement Learning (DRL) refers to algorithms and methods that use DNN to approximate the Q-function or optimal policy.
It is commonly assumed that the MDP has a fully observable state space S, that is, that the agent has access to observations that fully represent the state of the environment. However, the observation may just be a partial representation of the underlying state. A Partially Observable Markov Decision Process (POMDP) is an extension of an MDP, where the agent cannot fully observe the system state. In that case, the MDP 4-tuple is extended with a space of observations, o 2 O, and a usually unknown and potentially stochastic function that maps the observations to true underlying states, T: o ! s. Partial observability may stem from many factors, including limited sensing capabilities or unknown environment dynamics [23]. Let us remark that the agent with partial observability cannot know which is the real state corresponding to the reward received [32]. Despite the ubiquity of POMDPs in many practical systems [23], most of the DRL algorithms assume an underlying MDP [22,25]. POMDPs are usually addressed in DRL by augmenting the observation space with the history of past observations and actions and the use of Recurrent Neural Networks (RNN) [23] in the architecture of the learning algorithm. Only recently, a few algorithms have been specifically designed to deal with POMDPs [23]. Moreover, the state transition in some environments is determined not only by agent actions, but also by exogenous stochastic input actions [33]. Efficient methods to deal with this kind of environments are discussed by Mao et al. [33] Most of the state-of-art DRL algorithms used for MDPs are based on actor-critic methods: temporal difference learning algorithms that separate representations of value functions and policies explicitly [22]. The actor selects actions in the action space, while the critic estimates the value function from the action made by the actor. They can be applied to either discrete or continuous action spaces. However, these methods show poor sample efficiency and stability convergence properties. A variety of techniques have been developed to address those problems [22,25,34]. In this paper, we have used Soft Actor-Critic (SAC), a widely used continuous-state DRL algorithm [7,24], whose policy maximizes a trade-off between the expected return and entropy, a measure of randomness in the policy, which ensures higher robustness and stability [22]. Maximum entropy policies have been shown to solve POMDP with unobserved rewards [32]. Besides SAC, we have used Proximal Policy Optimization (PPO), another popular DRL algorithm [25]. PPO ensures that its policy does not change much from the previous policy updates, leading to smooth learning and avoiding variance in training. The tradeoff between SAC and PPO is stability and sample efficiency. PPO tends to be more stable and uses more data, whereas SAC tends to be the opposite. PPO is also claimed to work well on POMDPs [35]. Both allow the use of RNNs in their architecture.

Related work
State-of-the-art control algorithms for AP systems on devices already on the market are based on PID and MPC approaches [6,11,12,14,15,36,37]. PID and MPC controllers usually work as hybrid closed loop system, requiring announcements of meal carbohydrates amount and exercise activity [37]. Two commercially available FDA-approved systems are Metronics 670G and 770G using PID [36][37][38] and Tandem Control-IQ using MPC [36,37]. With PID and MPC, most commercial products avoid hypoglycemia overnight by utilizing Predictive Low Glucose Suspend (PLGS) [37]. PLGS technology predicts glucose concentration trends, then suspends insulin delivery before hypoglycemia occurs. A PID controller is simple but have problems to adapt to meal consumption [6,7]. Several variations to the basic PID approach have been proposed, as the use of insulin feedback (IF), which improves its performance [11,12]. MPC controllers are more proactive than PID in insulin delivery by predicting BG levels, but they need a minimal compact mathematical model. Di Ferdinando et al. [14] and Borri et al. [15] use nonlinear differential difference equation (DDE) models for the endogenous insulin delivery rate (IDR), which is better accounted for in these models. Since the IDR cannot be neglected for T2DM patients, and the DDE model reproduces it accurately, MPC that use DDE usually address T2DM. Their results show that MPC provides good performance as long as a minimal compact model is available. However, as the complexity of the model increases, MPC approaches are not tractable and one has to resort to other control methods.
Among these methods, the number of data-driven models for prediction of BG in T1DM is increasing [30]. Reinforcement learning is being used in recent research works in the field of health care. For instance, the RL Q-Learning algorithm was applied on discrete action space simulation for radiotherapy to understand scenarios of tumor growth and its treatment plan [39]. RL agent-based models have been used on continuous state and action spaces problems to find cytokine therapy for sepsis, reducing mortality from 49% on average to 0.8% [40]. RL is suitable for time sequence problems, as in the glucoregulatory system. Furthermore, the agent can learn the policies without the need for labeled data as in supervised learning methods [5,16]. The ability of RL to capture food intake patterns without human input makes it a good candidate for a fully closed-loop system and more responsive and safer policies [7].
Particularly about T1DM control, a recent review discusses most of the approaches used so far in this topic [5]. This review exposes the wide variety of alternatives used in almost all the defining elements of the RL framework, such as the definition of the state space, the action space, class of RL algorithms used and the reward function, what we have called the implementation strategies. To mention a few which differ in the class of RL algorithms, in [16] a control based on RL is proposed and the value function is not approximated by a DNN, but by a quadratic function, and in [41] the value function is approximated by a Gaussian process. In both cases, robust solutions are provided, but simplified glucose models are used. On the contrary, in this paper we use a more realistic simulator which generally requires DRL to approximate the value function.
Fox et al. adopt an approach similar to the one adopted in this paper in two recent papers [7,20]. In fact, they share the basic premise with this paper: training a DRL agent for BG closed-loop control. However, they employ different implementation strategies. In their first one [20], the state space comprises the previous 24 hours of CGM samples and insulin doses at 5-minute intervals, but the action space is discrete and made of only three insulin doses. Three relatively simple DNN architectures were used to approximate the value function and the improvements over the PID baseline were not particularly noticeable. In their second work [7], the state space is also made of CGM and insulin samples but only from the four previous hours and the action state is continuous, so the SAC algorithm is used. Additionally, the reward functions differ in both cases. In contrast, in this work we use only the last CGM sample as state but take a relatively long interval of 30 to 60 minutes between observations and actions and define a simpler reward function. A hybrid model-based approach is discussed by Yamagata et al. [13], which uses a discrete action space combined with meal announcement. Finally, very recently, Lin et al. [21] proposes a combination of machine learning methods for BG control: the controller uses a DRL SAC agent which is driven by a PID control as an initial policy and, in addition, the observation state is extended by the predictions of a dual attention network. Finally, the actions are also regulated by an adaptive safe action. The results of the last three aforementioned methods [7,13,21] are further compared with ours in the Discussion section.

Implementation strategy and methodology
In this section, we describe and discuss the design process and choices for our implementation strategy for a BG closed-loop control based on DRL. Our goal is to balance blood glucose as long as possible with low risks. First, we define the state and action space and discuss the reward function. Afterwards, we conduct an initial evaluation based on naive strategies to determine the features of the environment that may have more influence on the agent learning. This leads us to refine our design and propose an implementation strategy that is evaluated in the Results section.

Analysis of environment and initial design
The simglucose simulator environment implements the UVA/Padova glucose model [42] and provides CGM sensors that sample the BG level through a noisy (stochastic) process as well as a random and patient-dependent meal schedule. From this description, it is clear that the environment should be considered a POMDP, since the CGM observations of BG levels include noise reads from sensors. In fact, the underlying hidden states of the environment, s, are given by the glucose model states [42], rather than the BG level. Moreover, the dynamics of the environment, that is, the state transitions, are not only determined by the actions taken by the agent (insulin dose injected), but also by exogenous stochastic input actions [33], such as the intake of CHO (meals) or physical exercise. However, since simglucose does not consider physical exercise, unlike T1D_VPP [28], we restrict our study to external meals.
All the above considerations have an influence on the DRL controller design. We first define the elements of the DRL algorithm for our problem as follows: Observations and state. The mapping of the observable variables may determine whether the environment is a POMDP or an MDP. As an example, the number of frames included in the observations in the Atari Pong game makes it become a POMDP or a MDP [43]. We start by using only CGM samples as observation variables, since they are readily available. Unlike the work in [7], we do not use past actions (insulin doses) in the observations. Since we target a closed-loop controller, we do not include CHO intake as part of the observation, which has to be announced by the user, even though some devices may facilitate its announcement [44]. We start by using only the current observation, given by the last CGM sample, o 2 R þ . The CGM sample frequency is three minutes per environment step.
Action. The action is the amount of basal insulin that the patient gets injected. It is a decimal number, ranging between 0 to 30 units, a 2 [0, 30], according to the specifications of the insulin pumps implemented in the simulator.
Goals and risk metrics. Safety is crucial in healthcare applications. The main goal of our BG controller is to balance the BG level for as long as possible with low health risks. A commonly used metric of risk associated with BG levels is the blood glucose risk index (BGRI), and it has also been used to measure the performance of control methods [45]. BGRI is a measure of glucose variability and associated risks and it is based on a symmetrization of the BG measurement scale [45]. The Clarke BGRI is defined as BGRI = LBGI+ HBGI, where LBGI and HBGI are computed over a series of n CGM samples as: and where LBGI and HBGI represent the risk associated with low and high BG levels. They are computed from the following function: Noted that BG is measured in mg/dL. f(BG) is the basis to calculate the BG risk function using the formula r(BG) = 10 × f(BG) 2 and separating it as low rl and high rh as follows: The curves from Clarke and Magni are shown in Fig 2. As can be seen, the BGRI adequately captures the increased risk associated with hypoglycemia for the patients.
It is common practice to define the reward function in terms of the Clarke or Magni RI [7,20]. But, as we discuss next, they may not be adequate to capture our intended goal.
Termination limits and safety. In DRL, an environment finishes when some condition is met. For instance, in standard OpenAI gyms such as the BipedalWalker [46] the episode finishes when the robot falls. In simglucose, the episode ends when the BG level goes out of a predefined range. T1D patients should aim for a target range of 70-180 mg/dL [47]. We have configured simglucose to end the episode when BG < 70 mg/dL or BG > 350 mg/dL in order to try to avoid dangerous BG levels. Let us note that this is a quite conservative range. In contrast, in [7], episodes are done when the BG falls below 10 mg/dL or raises over 1000 mg/dL. Reward. A crucial, and problematic, aspect of DRL is the need to design a reward function that helps the agent learn the intended goal [43]. In our solution, we have tried different approaches. In particular, we have tried using the negative of the Clarke BGRI as reward function, in order to keep the BG at the desired level, but in this case negative rewards induce the agent to terminate early, leading to dangerous regimes. Basically, the agent learns to inject more insulin to avoid keeping receiving negative rewards, provoking hypoglycemia. A termination penalty to correct for this behavior is usually introduced, as is done in [7]. From our point of view, this solution is not satisfactory, because the value of the reward at termination becomes effectively another hyperparameter. It requires to be tuned for the expected duration of the episodes. In fact, as an extreme case, since the desired goal is to avoid termination at all, the value should be set to infinity, or at least to a value high enough to counteract the expected lifetime of the patient. We have tried a different strategy: combined with the conservative termination limits mentioned above, we use a simple reward to encourage large episodes, which results in extended periods within adequate BG level regimes. Of course, our reward function can be refined, for instance, considering more sophisticated safe zones. Therefore, the reward is simply: Discount factor. The discount factor helps to balance the importance of immediate and future rewards. Since the effect of insulin on the BG is usually delayed, we set it to a relatively large value of γ = 0.999.
Virtual population. The UVA/PADOVA simulator provides parameters for fully specifying the glucoregulatory system of patients in three groups: children, adolescents, and adults, each category including 10 patients. According to the simglucose developer [48], the patient parameters correspond to the 30-patient subset available for the academic edition of the 2008 commercial UVA/PADOVA simulator. The commercial version provides a virtual population of 100 patients in each group.
Although 30 patients cannot cover all the possible variations in a heterogeneous population, the size of our population is similar to most of the previous works based on RL: according to the systematic review by Tejedor [5], only 3 out of 23 proposals that used in silico patients have over 30 patients. These proposals use 100-patient sets from the UVA/PADOVA simulator. Moreover, ten of the reviewed proposals use just one in silico patient. Unlike other methods, such as model-based classical control (MPC) [14,15], which have an almost negligible computational cost and can be evaluated on thousands of virtual patients, methods based on RL typically use a reduced number of patients. There are two main practical reasons behind these low numbers. First, training RL agents is a highly time-consuming and resource-intensive task. Effective training of a single patient with a set of parameters and hyper-parameters typically takes seven to ten days with our mid-level computing facilities (intel i9-10920X, 64 GB RAM, 2 Nvidia RTX 2080 GPUs). We are able to train 4 to 8 patients in parallel. Second, even if enough time and computing power is available, to train an RL agent, we have to rely on a proven training environment and a set of validated virtual patients. Generating additional patients is possible by sampling from the joint distribution of the model parameters, as described in [9], and variations of this generation method have been used by Di Ferdinando [14] and Borri [15]. However, the values of several parameters were not published, which makes it necessary to guess some of them. Pompa et al have very recently compiled the required parameters for future implementations [49]. Nevertheless, we consider that the effort required to rigorously generate patients is beyond the scope of the current paper.
Thus, for an initial search for a viable implementation strategy for RL, which is the paper goal, we consider that 30 patients split into age groups is a reasonable trade-off. As said, our choice is in line with most of the previous works on this topic and even surpasses most of them. Once a viable implementation strategy has been established, a more comprehensive training campaign can be carried out, including the generation of additional virtual patients.

Initial evaluation
We conducted a series of initial tests to determine the features of the environment that may have a greater influence on the agent learning, according to the initial choices described in the previous section. Simglucose comes with a population of 30 virtual patients: 10 adolescents, 10 adults, and 10 children, which statistically represent different cohorts of patients [19]. We have tested on one patient from each group the initial alternatives that are summarized below and in Table 1.
• Algorithms: PPO and SAC, with a standard configuration using dense DNN with two hidden layers of 256 units. In addition, we used an alternative recurrent architecture intended to capture temporal context, which uses as actor and critic networks a RNN with a 10 LSTM cells. We call this variant PPO-RNN and SAC-RNN. The performance of PPO-RNN and SAC-RNN was similar to the one shown in Table 1 and will not be reproduced.
• Observation space: we used as observation both (O1) the current CGM sample and a (O2) vector of the past 20 CGM samples, corresponding to the previous hour at 3-minute intervals.
• Reward functions: we used (R1) the one point per step reward of Eq (7) and (R2) the negative of the Clarke BGRI with a termination penalty. In all the case we set the termination limits to BG<70 or BG>350.
• Meal schedule: the simulator selects a non-deterministic meal schedule particular to each patient according to the Harris-Benedict algorithm [7].
We used the PPO and SAC implementations from stable-baselines3 [50] and PPO-RNN, SAC-RNN from TensorFlow Agents (TF-Agents [51]) and trained it on two Nvidia GeForce RTX 2080 GPUs and Intel Core i9-10920X CPU @ 3.50GHz 12 cores. All agents were trained for 1 million steps, keeping the best model (best average reward) and with a maximum episode length of 10 000 steps (a step represents 3-minute interval).
From the average length of the episode, it can be seen that the agents are not able to keep a safe control of BG levels beyond 10 hours. In general, PPO is able to achieve longer duration within a safe BG range, because it injects lower insulin levels and patients spend more time in a hyperglycemic state. In fact, most of the episodes with PPO end because of high BG levels. On the contrary, SAC tends to inject insulin aggressively and patients go rapidly to a hypoglycemic state, ending the episode. In both cases, using just the current CGM observation (O1) or the last one-hour interval of CGM observations (O2) has little influence, and using the simple reward (R1) works better than the negative of the Clarke BGRI.

Discussion and refinement of design
From the previous evaluation, we hypothesize that the three main characteristics of the simulation that affect the lack of training success are: (1) The action drives the BG level to lower values, but the rise of BG level depends on the environment dynamics and as consequence of the exogenous input actions, that is, the intake of CHO in the meals. (2) The observation (CGM) is a noisy sample of the BG level. (3) The effect of actions on the BG level is delayed. That is, applying an action does not immediately decrease the BG level. Even though those effects are expected from the initial analysis of the environment, and not particularly surprising, we think they deserve further discussion.
Regarding (1), the agent should learn the policy to deal with this fact, that is, that it does not need to deliver insulin when the BG level is low. In fact, our results show that the PPO agent is able to learn policies that anticipate the meal consumption and the subsequent rise of BG. However, they are not enough precise to control adequately the BG levels. This is probably because of the conservative BG range that leads to an early episode ending. Regarding (2), let us just notice that the reward is usually assigned according to the actual BG level, not the CGM sample, which seems to negatively affect learning since it makes termination penalties appear random. But even using only BG levels, instead of noisy CGM samples, did not improve learning. Regarding (3), both the large discount factor and the recurrent architecture should have improved learning. But the termination limits of the environment, the glucose dynamics and the randomness of meal schedules make it hard to learn:if the agent tries high insulin does, the patient goes very quickly to hypoglycemia, and the episode ends, and when the agent injects low doses, meals, which are not included in the observation, raise the BG level ending also the episode. Therefore, it seems that the combination of all these factors prevents the agent from properly learning. Both (2) and (3) stem from the fact that we are dealing with a POMDP and the recurrent architecture should improve learning, but several variations tried in our tests did not actually improve much. We could have tried changing the RNN architecture and other hyperparameters, but due to the large parameter space, we chose to focus on the delay of actions as follows.
Observation frequency and insulin response. We just configured the environment to decrement the frequency of the observations and actions, instead of using the usual 3 or 5-minute CGM sample resolution. Even though the samples are taken, since the simglucose environment is running and updating the state in mini-steps of 1 min, they are passed less frequently to the agent, which then provides an action. The rationale is simple: to let the agent observe the actual effects of its actions, that is, the actual patient insulin response according to the glucose dynamics, instead of a seemingly random transition. Even though it is similar to using as state a history of the previous CGM data, as is done in [7], it improves the learning process and leads to more adequate insulin regimes in our results. Therefore, we have introduced an additional hyperparameter, observation frequency, which is actually already present in the simulator (CGM sample resolution) although usually left at the default device value (a CGM sample every 3 minutes) [7]. Let us notice that this new hyperparameter does not increase the complexity compared to using a history of previous samples [7], since in that case, the lengths of the history vectors are also additional hyperparameters to be tuned.
We selected the frequency for the observations by testing the insulin response time when injecting a given unit of insulin without taking any meals. This delay is different depending on the patient group, as expected, and it is shown in Fig 3: ten different amounts of insulin doses were used to estimate each subject insulin reaction, from 1 to 30 units. As can be seen, the adult response to insulin tends to be more stable and less pronounced and the reduction in BG starts to show around 45 minutes after injection. Adolescent reaction is slightly slower and more pronounced. Finally, children clearly react faster and more strongly to insulin. In fact, high insulin doses quickly drive some of the patients to hypoglycemia and episode termination. From these observations we chose reducing the frequency of observations as follows: for adults, observations are made every 45 minutes (corresponding to 15 three-minute environment steps), adolescent frequency is set to 30 minutes (10 steps) and children is set to 15 minutes (5 steps).
Then, we started over the training of the SAC, SAC-RNN, PPO and PPO-RNN agents. After running several experiments, only PPO-RNN could learn effective policies. Therefore, we have selected this algorithm and architecture for a full evaluation of performance in the Results section.

Summary of implementation strategy
We provide here a brief summary of our implementation strategy, before conducting an evaluation and comparison of alternatives in the next section.
From the discussion in the previous section we derive the following implementation strategy: (1) we reduce the observation frequency of the environment state (CGM) and, hence, rewards passed to the agent, depending on the subject (45 minutes, 1 hour and 15 minutes for adults, adolescents and children, respectively), (2) we set broad termination limits for the episodes, BG 2 [10,1000], to let the agent explore more thoroughly the environment; and (3) we use the simple reward function of Eq (8) below, to force the agent to learn to keep the patients in euglycemia for as long as possible.  Let us note that the reward is accumulated during all the simulation ministeps and then passed to the agent. For example, if we set the observation frequency to 10, the environment is going to simulate 10 ministeps before passing the sample to the agent, and, if BG level has been in the desired range all those ministeps, the accumulated reward passed will be 10.

Experimental setup
Training and evaluation. Our goal is to keep the patient BG level in the selected range for as long as possible. We have trained a PPO-RNN agent for each of the patients with the implementation strategy summarized in the previous section and the hyperparameters listed in Table 2. During the training, there are periods of instability, where the average reward drops, as reported in other studies [7]. Rigorous convergence of the RL algorithms tested in this paper, SAC and PPO, including their variants with RNN, has not been proved analytically, in general, to the best of our knowledge. Only recently, the convergence of PPO to a local minimum of the associated losses has been proved [52]. In practice, the convergence of the algorithms is assumed when the learning curve does not improve over time. In our case, both PPO and SAC tend to show an oscillatory behavior in the learning curves so that the learning curve increases and then drops. PPO is able to recover from this behavior and we stop the training process when the learning curve has stabilized. The reason for these oscillations is likely the presence of unbounded exogenous stochastic inputs, that is, the meals or the noisy observations. We save the policy every 100 training steps and select the policy with a highest average reward as a trained agent. Once trained, the agents are evaluated 20 times with different seeds for all the patients, and statistics for episode length, fraction of time in glycemia states (eu, hypo and hyper) and other metrics are collected. For evaluation, we also set the environment termination limits to BG = 10 and BG = 1000, in order to test the fraction of time that the patients spend in the different states and make them comparable to similar proposals [7]. Let us remark that patients whose BG reaches levels below or above those limits are considered events that result in serious damage or death.

Baselines
We have compared our results with four baselines: a basal-bolus regime (BB), that simulates the usual self-managed treatment for patients with both T1D and T2D, basal-bolus with cooldown, a PID, and PID with insulin feedback baselines.

Basal-bolus Baseline (BB).
A multiple daily injection therapy which involves using longacting insulin with a dosage of basal: where u2ss is the steady state insulin rate per kilogram pmol/L × kg; and short or rapid-acting insulin (bolus) to regulate blood glucose concentration with bolus: where CF is a correction factor, t is the time between samples and CR is a carbohydrate ratio [7].
To obtain a more stable regime, an alternative is to apply a cooldown signal to the basalbolus insulin delivery policy (BB-CD) to ensure that each meal is corrected only once: where cooldown is 1, if the patient has not had meals in the past three hours and otherwise is 0. Let us note that this treatment requires the patient to be aware of the meal intake and so it is not a closed-loop control. The controls that require explicit knowledge of meal intake are usually called controls with meal announcement in the literature.

PID baseline (PID).
A closed-loop control which uses a discrete PID controller aims to set the system output to a given target, s t , by setting the control variable a k as a linear combination of three terms: where P(s k ) = s k − s t , Iðs k Þ ¼ P k i¼0 ðs i À s t Þ and D(s k ) = |s k − s k−1 |. We use the optimal values for the PID parameters, K p , K d , K i , for each patient provided by Fox et al. [7]. In fact, insulin in blood suppresses the next insulin production, called insulin feedback. Thus, we introduced PID control with insulin feedback (PID-IF) based on [11], where γ is the degree of suppressed insulin delivery by the current plasma insulin, which is equal 0.5, K pi is the normalized insulin concentration in units. K pi is equal to 1, and I p (t) is the model of pharmacokinetics of insulin adapted from [12], given by where the parameter I B is the insulin injected in the previous action, and τ 1 and τ 2 are time constants (in minute) associated with the subcutaneous absorption of insulin equal to 55 and 70, respectively.

Episode length
First, we examine the average episode length in evaluation. Table 3 shows the average fraction of episodes that were completed by each patient group. Note that an episode is terminated when the BG level goes out of the 10-1000 mg/dL range, which means that the patient has reached a BG level that may result in serious damage or death. A primary goal, therefore, of any control method is to avoid early episode termination. Almost all control methods are able to make all the patients finish the 10-day evaluation simulations for all the groups except for BB. The adult group is the easier to control and all patients. PPO-RNN, BB-CD, PID and PID-IF were able to finish the 10-day evaluation period for all the 20 simulations. On the contrary, the children are the most difficult group and BB can only reach on average 78% of full episodes, that is 7.8 days. The introduction of a cooldown (BB-CD) improves basal-bolus overall episode length. But being able to finish the evaluation period is not enough to determine the quality of the treatment: BG levels of patients must be kept in the desired range for as long as possible. In the following sections, we examine how the controls keep the state of the patient during that period.

Risk index and glycemic states
We compare the results of BB, BB-CD, PID, PID-IF and PPO-RNN controllers for the risk index and fraction of time spent in hyper, hypo and euglycemia. The aim is to determine how well controllers regulate the risk of hypoglycemia and hyperglycemia. First, Fig 4a shows with boxplots the distribution for all the methods evaluated. As can be seen, PPO-RNN makes all patients spend more time in an euglycemic state than the baselines, which is the actual goal of this mechanism. Both the median and 25 and 50 percentiles are above those of the other methods. In addition, PPO-RNN also outperforms the baselines globally in terms of the fraction of time spent in hyperglycemia and hypoglycemia. It is instructive to remark how the distributions are more informative in this case that single point estimates, such as the median or mean. For instance, even though the median for the hypoglycemic fraction is similar for PPO-RNN and PID, we can see that a remarkable number of patients spent an unacceptably large fraction of time in hypoglycemia with all PID variants.
In Fig 4b we show that PPO-RNN also outperforms the baselines when the patient groups are examined separately. The results for children are especially noteworthy, since, as we have discussed, it is the most difficult group to train. For that group, we remark that: (1) PPO is able to perform especially well to avoid hypoglycemia, unlike PID, which fails clearly in this aspect; and (2) BB seems to provide reasonable results for children. However, for BB, the results of the previous section have to be taken into account, that is, that the control is only able to reach 78% of the full episode on average. In other words, the control may provide a good response on average for typical BG levels and meal intakes, but not be able to react to unusual variations, which drive the patient to a dangerous state. On the other hand, BB-CD is able to practically eliminate hypoglycemia, but at the cost of a much higher hyperglycemia for all groups. Therefore, to better assess the results, it is necessary to look also at the risk index, which informs us whether the patient is at a safe level within the desirable range. For instance, a patient may spend a large fraction of the time in the euglycemic range but with BG levels very close to the hyper or hypoglycemia thresholds, which may make it vulnerable to unusual conditions, such as irregular meal intakes. Recall that RI penalizes more hypoglycemia, because even though both hypoglycemia or hyperglycemia can lead to fatal outcomes, the short-term effects of hypoglycemia can cause T1D patients to have an immediate crisis [37], as opposed to hyperglycemia, whose effects manifest in the long term. We show the average RI, HBGI and LBGI all over the evaluation period globally in Fig 5a and by groups in Fig 5b. In both cases, PPO-RNN outperforms the other controls and keeps all the RI metrics within reasonable levels, unlike the baselines, especially PID and BB-CD, which show high RI metrics. These results show that PPO-RNN keeps the patients within safe limits most of the time, unlike the baselines, which do not success in smoothly control BG levels: Even though the patients may be euglycemic, they exhibit less safe BG levels. This is the reason why, combined with poor adaptability, some BB-controlled patients are not able to finish the 10-day episodes.

Discussion
The Results section shows that our implementation strategy can effectively control BG levels and consistently outperforms the baselines. A first result to remark is that our tests showed that no patients controlled by PPO-RNN terminated earlier in any of the evaluation trials. Early terminations are called catastrophic failures by Fox [7], since they signal potentially fatal BG levels. Our work in this aspect shows slightly better results than Fox [7], which shows evaluation failures of around 0.1%, but their evaluation extended to 100 10-day replications per patient, whereas ours has been limited to 20 10-day replications. On the other hand, we define a failure when BG goes below 10 mg/dL, whereas they apply an even lower threshold of 5 mg/ dL. PPO-RNN clearly improves over Yamagata [13], which reports very high failure percentages, with some patients completely unable to finish any episode. Of course, the critical nature of BG control requires exhaustive evaluation of results. Closed-loop controllers must be evaluated by its capability to keep the BG within an acceptable range for years. As a future work we will evaluate for extended time periods, but, since our tests show that patients keep BG levels above 50 mg/dL and below 400 mg/DL in all the replications so far, we are confident PPO-RNN can avoid failures for longer episodes.
Attending to the time spent in euglycemia, our results are in line with the ones reported by Fox [7], but outperforms those of Lim [21] and Yamagata [13]. Regarding the former, the implementation strategy, as well as ours, is able to keep euglycemia over 73% of time globally, and, in our case, also for all groups. Our implementation strategy can be considered simpler than the Fox one because our observation space is unidimensional (last CGM sample), which makes training more efficient, whereas Fox uses 96 dimensions (previous 48 CGM and insulin data samples). The extended observation frequency for each patient group is an additional hyperparameter to be optimized in our case, but the choice of 48 previous samples is also a hyperparamter in their case. Lim and Yamagata papers show only marginal improvements over the baselines using different strategies, and in both cases only report 64% of time in euglycemia. Moreover, in all the discussed papers, the children group shows more difficulties to be appropriately controlled with a closed-loop controller. In fact, in the work by Lim [21] the children group is not evaluated at all, despite using a fairly sophisticated control involving DRL driven by PID as initial policy. Similarly, in [13] only three patients of each group were evaluated.
The evaluation discussed in this proposal and the aforementioned works show that the application of DRL to BG control is a challenging task. Our results show, however, that DRL has a great potential as a closed-loop controller. Proposals put forward so far discuss just a tiny fraction of the space of potential implementation strategies for DRL-based BG control. The fact that the employed ones at the moment are relatively simple but exhibit a performance better than PID and PID-IF, encourages further exploration of this approach. In our case, there is a great margin for improvement. As a future work, we plan to optimize the hyperparameters of our method and test variations in the RNN architecture. We think that changes in the reward function, narrowing the desired range to obtain reward, may help improve the euglycemia fraction. The introduction of an extended observation frequency in our case has been the key to make agents learn effectively. It is introduced as a new hyperparameter, which may be optimized, but we plan to try strategies to make agents learn it. Once the strategy has been optimized, we plan to conduct a thorough analysis of the way the agent applies insulin doses and compare it with those prescribed by clinical practitioners. The goal is to find whether trained policies employ unusual patterns of dosage that may help clinical practice. In any case, we acknowledge a slight limitation of our approach: the small size of the virtual population of patients, which cannot completely reflect a heterogeneous population. As discussed in previous sections, this is a common drawback of current RL proposals, due to demanding time and computation requirements of this method, and specially, the lack of validated patient sets. Since an exhaustive compilation of the required parameters have been recently published [49], we plan to generate and validate additional sets of virtual patients. Finally, as alternative implementation strategies, it should be worth trying algorithms designed specifically to deal with POMDP and input-dependent environments [23,33].
As a final note, it is important to consider how to practically apply RL-based controls in real patients. Only a proof-of-concept approach can be considered, due to the difficulties for real application. That is, training an agent requires experimentation (insulin injection) on the subject to learn the optimal control, which is out of the question for real patients. It can only be tested on virtual patients and how to transfer it to real patients is a difficult matter [53]. However, our results may be used with more realistic approaches, such as offline RL [54]. With this novel method, the agent, usually a neural network with a transformer architecture, is trained on previously collected datasets, without direct experimentation on the subject. Those datasets may correspond to a series of BG levels and insulin doses collected from real patients, but they may also come from simulations on virtual patients, and both can be combined. Surprisingly, agents trained this way may show better performance than the original methods [55,56], especially if the datasets contain high-reward regions of the state space. Therefore, the availability of methods to generate very diverse datasets for further use as input, combined with additional data sources, is importante in offline RL. The agents proposed here can actually generate such high-rewards datasets.

Conclusion
In this work, we propose a closed-loop BG level control based on DRL. We discuss the particular characteristics of a realistic simulator of the glucoregulatory system as a training environment for DRL agents and the complexity of their training in this environment. Effective training of such agents can be achieved by very different design choices for the learning process, which we call the implementation strategy.
We describe the initial evaluation of several alternatives conducted on a T1DM UVA/ PADOVA simulator and, based on the results, propose a particular implementation strategy based on reducing the frequency of the observations and rewards passed to the agent, and using a simple reward function. PPO-RNN agents are trained with that strategy for three groups of patients, evaluated and compared with PID, PID-IF, BB, BB-CD baselines. Our system is able to outperform common PID and BB strategies in overall terms, attending to healthy glycemic states and the risk index. A critical discussion of the results and a comparison with several recent works is provided, indicating that our system outperforms current solutions at a lower computational cost. Euglycemia is maintained in 73% of the time, and no early termination events (BG out of range) are reported. Hence, our results show DRL as a promising methodology for implementing closed-loop BG control.
There is still large space of potential strategies and margin for improvement in future research. In particular, as a future work, in addition to variations on our current strategy, we consider the use of novel algorithms for POMDP and exogenous stochastic input actions. In our continued work, we plan to consider offline RL and investigate it as a T1D plasma glucose controller.